100 research outputs found

    A practical approach to language complexity: a wikipedia case study

    Get PDF
    In this paper we present statistical analysis of English texts from Wikipedia. We try to address the issue of language complexity empirically by comparing the simple English Wikipedia (Simple) to comparable samples of the main English Wikipedia (Main). Simple is supposed to use a more simplified language with a limited vocabulary, and editors are explicitly requested to follow this guideline, yet in practice the vocabulary richness of both samples are at the same level. Detailed analysis of longer units (n-grams of words and part of speech tags) shows that the language of Simple is less complex than that of Main primarily due to the use of shorter sentences, as opposed to drastically simplified syntax or vocabulary. Comparing the two language varieties by the Gunning readability index supports this conclusion. We also report on the topical dependence of language complexity, that is, that the language is more advanced in conceptual articles compared to person-based (biographical) and object-based articles. Finally, we investigate the relation between conflict and language complexity by analyzing the content of the talk pages associated to controversial and peacefully developing articles, concluding that controversy has the effect of reducing language complexity

    Testing the robustness of laws of polysemy and brevity versus frequency

    Get PDF
    The pioneering research of G.K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. Here we focus on a couple of them: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. Here we evaluate the robustness of these laws in contexts where they have not been explored yet to our knowledge. The recovery of the laws again in new conditions provides support for the hypothesis that they originate from abstract mechanisms.Peer ReviewedPostprint (author's final draft

    Languages cool as they expand: Allometric scaling and the decreasing need for new words

    Get PDF
    We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature

    Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death

    Get PDF
    We analyze the dynamic properties of 10^7 words recorded in English, Spanish and Hebrew over the period 1800--2008 in order to gain insight into the coevolution of language and culture. We report language independent patterns useful as benchmarks for theoretical models of language evolution. A significantly decreasing (increasing) trend in the birth (death) rate of words indicates a recent shift in the selection laws governing word use. For new words, we observe a peak in the growth-rate fluctuations around 40 years after introduction, consistent with the typical entry time into standard dictionaries and the human generational timescale. Pronounced changes in the dynamics of language during periods of war shows that word correlations, occurring across time and between words, are largely influenced by coevolutionary social, technological, and political factors. We quantify cultural memory by analyzing the long-term correlations in the use of individual words using detrended fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined, eliminates substantial material and incorporates referee comments: 19 pages, 14 figures, 3 table

    Scaling Laws in Human Language

    Get PDF
    Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed to explain the above difference, which takes into account the effects of finite vocabulary size. Experiments, simulations and analytical solution agree well with each other. The results show that the frequency distribution follows a power law with exponent being equal to 1, at which the corresponding Zipf's exponent diverges. Actually, the distribution obeys exponential form in the Zipf's plot. Deviating from the Heaps' law, the number of distinct words grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. This work refines previous understanding about Zipf's law and Heaps' law in language systems.Comment: 6 pages, 4 figure

    Niche as a determinant of word fate in online groups

    Get PDF
    Patterns of word use both reflect and influence a myriad of human activities and interactions. Like other entities that are reproduced and evolve, words rise or decline depending upon a complex interplay between {their intrinsic properties and the environments in which they function}. Using Internet discussion communities as model systems, we define the concept of a word niche as the relationship between the word and the characteristic features of the environments in which it is used. We develop a method to quantify two important aspects of the size of the word niche: the range of individuals using the word and the range of topics it is used to discuss. Controlling for word frequency, we show that these aspects of the word niche are strong determinants of changes in word frequency. Previous studies have already indicated that word frequency itself is a correlate of word success at historical time scales. Our analysis of changes in word frequencies over time reveals that the relative sizes of word niches are far more important than word frequencies in the dynamics of the entire vocabulary at shorter time scales, as the language adapts to new concepts and social groupings. We also distinguish endogenous versus exogenous factors as additional contributors to the fates of words, and demonstrate the force of this distinction in the rise of novel words. Our results indicate that short-term nonstationarity in word statistics is strongly driven by individual proclivities, including inclinations to provide novel information and to project a distinctive social identity.Comment: Supporting Information is available here: http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0019009.s00

    How university’s activities support the development of students’ entrepreneurial abilities: case of Slovenia and Croatia

    Get PDF
    The paper reports how the offered university activities support the development of students’ entrepreneurship abilities. Data were collected from 306 students from Slovenian and 609 students from Croatian universities. The study reduces the gap between theoretical researches about the academic entrepreneurship education and individual empirical studies about the student’s estimation of the offered academic activities for development of their entrepreneurial abilities. The empirical research revealed differences in Slovenian and Croatian students’ perception about (a) needed academic activities and (b) significance of the offered university activities, for the development of their entrepreneurial abilities. Additionally, the results reveal that the impact of students’ gender and study level on their perception about the importance of the offered academic activities is not significant for most of the considered activities. The main practical implication is focused on further improvement of universities’ entrepreneurship education programs through selection and utilization of activities which can fill in the recognized gaps between the students’ needed and the offered academic activities for the development of students’ entrepreneurial abilities

    The “ebb and flow” of student learning on placement

    Get PDF
    There is a rise in interest in work based learning as part of student choice at subject level in the UK (DOE 2017) but there remains an absence of specific guidance on how to best support higher education students learning on placement. An alternative HE experience in England, the degree apprenticeship, underlies the continued focus by policy in securing placement experiences for students without stipulating the type of support that is required at the ‘coal face’ of work based learning. Policy documents (UUK 2016), that urge universities to enter into partnership agreements with both employers and FE colleges to plug skills shortages, are noticeably lacking in their appreciation of the unique qualities of work based learning and how best to support students in this setting (Morley 2017a). Unfortunately, this is not unusual as placements have predominantly been an enriching ‘add on’ to the real business of academic learning in more traditional university programmes. Support initiatives, such as that described in chapter 9, are a rare appreciation of the importance of this role. Undergraduate nursing programmes currently support a 50:50 split between practice learning in clinical placements and the theory delivered at universities. Vocational degrees, such as this, provide an interesting case study as to how students can be supported in the practice environment by an appreciation of how students really learn on placement and how hidden resources can be utilised more explicitly for practice learning. During 2013 – 2015 a professional doctorate research study (Morley 2015) conducted a grounded theory study of 21 first year student nurses on their first placement to discover how they learnt ‘at work’ and the strategies they enlisted to be successful work based learners

    Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis

    Get PDF
    The Voynich manuscript has remained so far as a mystery for linguists and cryptologists. While the text written on medieval parchment -using an unknown script system- shows basic statistical patterns that bear resemblance to those from real languages, there are features that suggested to some researches that the manuscript was a forgery intended as a hoax. Here we analyse the long-range structure of the manuscript using methods from information theory. We show that the Voynich manuscript presents a complex organization in the distribution of words that is compatible with those found in real language sequences. We are also able to extract some of the most significant semantic word-networks in the text. These results together with some previously known statistical features of the Voynich manuscript, give support to the presence of a genuine message inside the book
    corecore